import hw2_2
import hw5_2
import hw9_1
$\text{Altyn Rymbek $^{1}$, Ruilin Qian $^{2}$, Ziyang Qin $^{2*}$}$
$^1$ Department of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States \ $^2$ Department of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States \ $^*$ Correspondence: azhelamb@caltech.edu; rqian@caltech.edu; zqin@caltech.edu
$^ \textbf{_______________________________________________________________________________________________________________________________________________}$ $\textbf{ABSTRACT:}$ Microtubules are polymers of tubulin that form part of the cytoskeleton and provide structure and shape to eukaryotic cells. In order to figure out how the switching of a microtubule from growth to shortening, it's crucial to study the dynamics of microtubule catastrophe. Differential interference contrast (DIC) microscopy was utilized as a method to detect fluorescence of the marked tubulin to track this process without letting laser light interfere it. To study this process, we extracted the time of microtubule catastrophe and gained a better understanding after making ECDFs, interval confidence and model simulation to obtain the most suitable model for microtubule catastrophe. $^ \textbf{_______________________________________________________________________________________________________________________________________________}$
$\color{brown}{\textbf{INTRODUCTION}}$
Microtubule catastrophe is a ubiquitous phenomenon in eukaryotic cell, which is essential for many cellular processes, such as cell motility and mitosis. However, as an important biological process, how the turn over is regulated is still unclear. It's previously reported that this process is regulated by cellular concentration of GTP, so Garner and his coworkers tried to use fluorescence methods to measure the catastrophe time for the microtubules under different concentration of [GTP]. They also built the mathematic model that the catastrophe time fits gamma distribution, who parameters are regulated by [GTP]. This work is published on Cell.
In this work, for one thing, we are presenting a summary of their work in this paper. For another thing, we suspect that compared to the gamma distribution model they built, perhaps there is a better distribution model, the double exponential model. That's why we used MLE methods to compared these 2 models. And according to our result, the gamma distribution is still better.
Garner's work show that with a higher [GTP] in the system, the microtubule will intend to keep in stable, and the catastrophe will happen much later than under low [GTP]. Their work shows a significant regulation pathway to this important biological process, and their mathematic model helps understanding the dynamic of microtubule in vivo. Since many diseases are related to the behavior of cellular filaments like microtubule, their work is important for future cure of them.
$\color{brown}{\textbf{RESULTS & DISCUSSIONS}}$
$\color{blue}{\textbf{Microtubule Catastrophe Requires Multiple Steps}}$
Empirical distribution function (ECDF) behaves as a powerful tool for us to understand a general distribution of certain data. In our previous work, we defined ECDF(x) = fraction of data points ≤ $\text{x}$, in which we implemented this computation by iterating through the elements and indices of the provided Numpy array of x values and using the property of Numpy arrays to obtain the Numpy array of boolean values where 'True' indicates that the element is lower or equal to the current element. Then, taking advantage of the fact that 'True' stands for "1" and False stands for "0", we summed up these values and after exiting from the loop, we divided each element of the Numpy array by the length of the array. In the end, we returned two Numpy arrays of x and y values. In this way, we defined our own approach to calculate the ECDF(x) in the following work.
To initiate our analysis, we need to eliminate the possibility that aging of microtubule might result from light-induced damage after fluorescently labeling. As shown in $\textbf{Figure 1}$, both labeled and unlabeled tubulin shows similarly catastrophe time which indicates that fluorescent labeling did not significantly affect the microtubule dynamics during the TIRF imaging. Therefore, we were able to conduct further exploration of microtubule dynamics.
hw2_2.plot_ecdf()
Figure 1 ECDF of time to catastrophe (s) for fluorescently labeled and unlabeled cells|
After assuring the method is rigorous, we moved to think about how is the time of microtubule catastrophe distributed which will give us a hint about the dynamics of microtubule catastrophe. In $\textbf{Figure 1}$, we envisioned the ECDF against 'time to catastrophe (s)'. In this figure, we discovered that both labeled and unlabeled tubulin show a clear nonlinearity at early time points and fall well below the predicted constant catastrophe rate line which indicates time to catastrophe is not a uniform distribution.
$\color{blue}{\textbf{Model construction: how we chose the model and its proof}}$
We further analyzed our data where we concluded that since the cumulative distribution isn't increased linearly at early stage, so the process should not be first-order (single step process); moreover, further exploration demonstrated that catastrophe is actually a multi-step process, with n equal random steps; and each step happens independently at a slow rate, so they can be seen as Poison processes, mathematically, if each step is Poison process, the time of each step will be exponential distributed. Considering the arguments above, we concluded that, in our experiment, the time to catastrophe could be 'exponentially distributed'. To start with testifying our assumption, we simulated the distribution model as Gamma distribution. However, following fitting failed to support this assumption, hence, considering that our process is a multi-step process and it's 'exponentially distributed', we then hypothesized it could be a 'exponential distribution' which contains two Poison processes. Therefore, we firstly come to construct the PDF and CDF of targeted distribution mathematically.
For two sequential Poisson process, it's quicker to use convolution to calculate the overall PDF; and since we know that total time of two Poisson process, $t = T_1 + T_2$, so $T_2 = t - T_1$, in this case, $T_1 >0$, $T_2 > 0$ and $t = T_1 + T_2 > 0$, we drew out the PDF of the distribution as:
\begin{align} f(t; \beta_1, \beta_2) = \int_{-\infty}^{t} f_{\beta_1}(T_1) \ f_{\beta_2}(t - T_1) \ dT_1 \end{align}After a instructive mathematical simplification, we obtained the PDF and CDF of the distribution as the following:
With the expression of PDF and CDF in hand, we continued our investigation by simulating the models with random number generation method and compared them with theoretical distribution. When doing simulation we found that the more random numbers we generated to simulate, the more accurate the simulation is. As you can easily saw in $\textbf{Figure 2}$ and $\textbf{Figure 3}$, 150 random numbers may not be enough to stimulate such a multi-step process while 2000 random numbers could. According to this finding, we finally agreed that two Poison process can be an excellent way to describe how the dynamics of catastrophe works.
hw5_2.plot_double_exp_simulation()
Figure 2 150 random numbers to simulate distribution |
hw5_2.plot_double_exp_simulation_2000()
Figure 3 2000 random numbers to simulate distribution |
$\color{blue}{\textbf{Comparisons between two experimental group to testify the precision of our method}}$
To ulteriorly prove our experiment results won't be affected by the light-induced damage, we separately testified if the labeled group and control group share the same distribution and whether they are identical to each other. We commenced our probing with making confidence intervals of two distinct groups. In our case, to see whether the distributions of the catastrophe time for the labeled and unlabeled MT molecules are identical, the most intuitive way is to draw their ECDF curves and see whether the two curves match well or not.
import hw6_1
hw6_1.show_ecdfs_with_interval()
Figure 4 ECDF with confidence intervals of time to catastrophe (s) for fluorescently labeled and unlabeled cells |
In $\textbf{Figure 4}$, it clearly stated that there's no significant differences between two distributions, for they are largely overlapped. Satisfactorily, when checking the mean time of catastrophe of labeled and control group, we found that mean values of control group were consistent with fluorescent labeled group (the confidence intervals of plug-in estimates showed the mean value of control group as $\text{[355.79, 476.47]}$ while labeled group as $\text{[401.14, 481.73]}$ ) which gave us another evidence that two groups are identically distributed.
After calculated the mean value with confidence intervals for two groups, we were curious about since there are lots of 'time to catastrophe (s)' acquired from our experiments, whether it's possible for us to figure out how the mean value distributed. According to central limit theorem, we learned that it could approximately be Normal distributed, and the mean and variance can be estimated as $\mu = \overline{x}$, $\sigma^2 = \frac{1}{n(n-1)} \sum_{i=1}^{n} (x_i-\overline{x})^2 \ $, bearing these in mind, we computed the mean value based on Normal distribution, where we respectively got $\text{[350.84, 474.21]}$ for control group, $\text{[400.74, 480.68]}$ for labeled group which were consistent with our previous computation. This illustrated what central limit theorem states and gave us a mathematical understanding of how mean value distributed.
In statistics, p-value analysis is introduced as an efficient way to determine testing decision. Generally speaking, if $\text{p}$<0.05, the event exists statistic difference while if $\text{p}$<0.01, it indicates the significant difference in the event. In our work, we included two groups for p-value analysis. The results showed us we cannot deny our initial assumption which in other words, demonstrating there's $\textbf{no}$ statistic difference between two groups ( $\textbf{Figure 5}$ ).
import bootstrape
Figure 5 ECDF of bootstrap data of time to catastrophe (s) for fluorescently labeled and unlabeled cells |
UntitledFor better visualization, Dvoretzky-Kiefer-Wolfowitz Inequality was introduced as a method to compute the confidence intervals where it puts an upper bound on the maximum distance and the lower bound on the confidence interval. It's extremely helpful when the overlap is complex. And it gave us a clear graph shown in $\textbf{Figure 6}$ from which we can easily tell the difference and consistence of two groups (yellow-blue space) and (green-red space) for the 95% confidence interval.
import dkw
dkw.show_dkw()
Figure 6 DKW Inequality confidence intervals for fluorescently labeled and unlabeled cells |
$\color{blue}{\textbf{Maximum likelihood estimation (MLE) for microtubule catastrophe data}}$
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. As it gradually shows obvious advantages in data analysis, we were also interested in how the distribution model we hypothesized will behave under such assessment.
Noted in literature that we could apply both Gamma distribution model and two-Poison-process model into MLE calculations. So we firstly looked at the Gamma distribution model where we can easily calculate two parameters in Gamma distribution, $\alpha, \beta$, directly and 𝐚𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐚𝐥𝐥𝐲 from the dataset instead of numerically. It's because Gamma distribution obeys the following format where two parameters, $\alpha, \beta$, could be interpreted as:
\begin{align} f(y;\alpha, \beta) = \frac{1}{\Gamma(\alpha)}\,\frac{(\beta y)^\alpha}{y} \,\mathrm{e}^{-\beta y}\\ \end{align}\begin{align} \alpha &= \frac{mean^2}{variance}\\ \beta &= \frac{mean}{variance}\\ \end{align}Since it's easy for us to obtain the mean value and variance of our datasets, we could get the MLE of $\alpha, \beta$ using bootstraps without a lot of efforts.
When dealing with the model we constructed, the difficulty appeared since it's not easy to write the log-likelihood function for two parameters, $\beta_1, \beta_2$ which is crucial for MLE. So it took us a while to think of how to simplify the format of PDF in order to obtain the log-likelihood easier. Under this strategy, we stipulate that $\beta_1 \leq \beta_2$ and $\Delta \beta = \beta_2 - \beta_1 \geq 0$. Assuming that all the observations are independent and identically distributed, we have the Likelihood function:
\begin{align} L(\beta_1, \Delta\beta; t) = \prod_{i} \frac{\beta_1(\beta_1 + \Delta\beta)}{\Delta\beta}e^{-\beta_1 t_i} (1 - e^{-\Delta\beta t_i}), \end{align}when we took the logarithm of Likelihood function above, we would get:
\begin{align} \ell(\beta_1^*, \Delta\beta^*; t) = n ln \beta_1 - n ln \Delta\beta + n ln (\beta_1 + \Delta\beta) - \beta_1 \sum_{i} t_i + \sum_{i} ln \left(1 - e^{-\Delta\beta t_i} \right) \end{align}In this way, we define the function log_like_poisson_exp(params, t) that computes the log likelihood using the above mathematical model.
Finally, we could derive the MLE of $\beta_1$ and $\Delta \beta$ where we got MLE values 0.0041127 and 0.00090066 respectively. Theoretically, the expectation of the 'mixed' exponential distribution could be derived through the integration of $|x| f(x)$ where x lies in ($-\infty$, $\infty$),
To test whether we got the correct answer for MLE, we computed the mean through our model with the mean value of datasets. It's not surprising to find that our strategy is precise and the 'biexponential' distribution could be a good model for catastrophe time, comparing the calculated mean ($442.63 \ s$) and the real mean ($440.71 \ s$).
$\color{blue}{\textbf{Different concentrations of total tubulins obviously influence the dynamics of microtubule catastrophe}}$
In order to study how the concentration of total tubulins would influence the dynamics we sets a series gradients for concentrations (7 to 14 $\mu \text{M}$) to study the influences. This range was chosen to minimize the number of missed catastrophe events so as not to bias the catastrophe length and time distributions. For example, below 7 $\mu \text{M}$, the microtubules tend to be short, and it is difficult to accurately measure the shortest ones. Above 14 $\mu \text{M}$, the microtubules tend to be long and to bend away from the surface (due to thermal fluctuations); it is difficult to accurately measure the longest ones in the TIRF assay.
hw9_1.show_ecdfs_dif_con()
Figure 7 ECDF of time to catastrophe (s) for different tubulin concentration |
In $\textbf{Figure 7}$, ECDFs suggested that the higher the tubulin concentration (uM) is, the more curves shifted to the right. This illustrated that the 'time to catastrophe (s)' increases when the concentration of tubulin (uM) increases.
$\color{blue}{\textbf{Model assessment for microtubule catastrophe time}}$
We previously discussed about two models, Gamma distribution model and 'double exponential' model, however we could figure out which model is better through ECDFs and MLE. Last but not the least, we focused on how to identify the best model.
Typically, $\textbf{Q-Q plot}$, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other, is a good way to help us illustrate the correlation between generative models and observed quantities.
hw9_1.show_qq_gamma_12()
hw9_1.show_qq_dp_12()
| Figure 8 Q-Q plot for Gamma distribution of 12μM concentration(up) |
|---|
| Figure 9 Q-Q plot for Poison model of 12μM concentration(down) |
Clearly, the generative model produced the observed transcript bounds, since the Q-Q plot shows no strong separation of the generative quantiles from the observed quantiles. However, similar Q-Q plot failed to tell more about which model is better.
Besides Q-Q plot, another efficient method to compare two models is $\textbf{predictive ECDFs}$. If we wish to accept a model with the MLE for the parameters as a good representation of the true generative model, we would like to see how data generated from that model would look. We can do a parametric bootstrap to generate data sets and then plot confidence intervals of the resulting ECDFs.
With this strategy, we obtained $\textbf{Figure 10}$ and $\textbf{Figure 11}$ where we could already see the difference between two models when time to catastrophe is large.
hw9_1.show_pe_gammar_12()
hw9_1.show_pe_dp_12()
| Figure 10 Predictive ECDF for Gamma distribution of 12μM concentration(up) |
|---|
| Figure 11 Predictive ECDF for Poison model of 12μM concentration(down) |
If you zoomed in the ECDFs of two models, it's distinct that Poison model failed to give a better simulation both in small and large time area. We could conclude from here that Gamma distribution should be a better model for microtubule catastrophe under concentration of 12μM. However, it's just what we saw from the figures. The most precise way to draw the conclusion is mathematical values. So Akaike information criterion provides the solutions.
Akaike information criterion is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection in mathematical form based on AIC weight, $w$. In order to prove Gamma distribution model is better than the Poison model, we performed AIC method to assess two models.
As computed, the AIC weight, $w$, for gamma distribution is 1 while $w$ for Poison model is almost 0 which suggest that the gamma distribution model is strongly preferred for 12μM microtubule concentration time to catastrophe (s) dataset. In this method, we not only virtually but numerically proved that the gamma distribution model is strongly preferred for time to catastrophe data under 12μM concentration.
Hence, it's probably a better choice to apply Gamma distribution model into different tubulin concentrations. In order to compare how $\alpha$ and $\beta$ varies when tubulin concentrations change, we plotted the confidence intervals of $\alpha$ v.s. $\beta$. Clearly shown in $\textbf{Figure 12}$, $\alpha$'s value changed much more than beta when the concentration is changed. Furthermore, $\alpha$ was significantly increasing, while $\beta$ changed less, and the overall catastrophe time was increasing so we concluded that the increase of concentration of [GTP] will largely ascend the $\alpha$ which demonstrates more processes needed for catastrophe and time to catastrophe become larger. It's actually in accord with the sense that hydrolysis of GTP provides energy for tubulin assembly where less GTP (less energy) will result in the microtubule catastrophe.
hw9_1.show_alpha_beta_region()
Figure 12 Confidence intervals of α, β plane for different tubulin concentration |
$\color{brown}{\textbf{CONCLUSIONS}}$
In conclusion, we compared two mathematical models for microtubule catastrophe time, where one is the Gamma distribution model suggested by Howard and co-workers, the other is the one we brought up. At the beginning, we eliminated the possible influences resulted by light-induced damage utilizing DIC microscopy along with TIRF imaging. ECDFs and confidence intervals comparisons demonstrated our results are not an artifact of imaging or growth conditions. Moreover, model assessment was realized with Q-Q plot methodology, predictive ECDFs and AIC analysis, which finally proved that Gamma distribution model could be a better model for microtubule catastrophe rather than double exponential model. This work provides us a systematic approach to compare two models both virtually and numerically. Besides, it offers a pathway to estimate the correlation between series of datasets. Moreover, we accidently proved that increase in tubulin concentration and [GTP] will stablize the microtubule which may result from energy supply. Future analysis of microtubule and the catastrophe process is still ongoing in our team.
$\color{brown}{\textbf{ACKNOWLEDGEMENT}}$
The authors thank Prof. Justin Bois for helpful discussions and suggestions, well-organized lectures and course website. We thank all TAs for their help, insights and ideas.
$\color{brown}{\textbf{REFERENCES}}$
[1] For all the experimental data we analyzed in this paper, see:$\ \ \text{J. Howard et.al.} \ \textit{Cell} \ 2011, 147, 1092–1103.$
[2] For reviews, see:$ \text{(a) D.A. Compton et.al.} \ \textit{Curr. Biol.} \ 2009, 19, 1937–1942. \ \text{(b) J. Shanks et.al.} \ \textit{Mol. Biol. Cell} \ 1996, 7, 663–675. $
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,iqplot,bebi103,jupyterlab
HW9.3 Ziyang wrote the solution to this problem. We then all also took part in finalizing this solution.